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12 Summary 

13 Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a 

H 14 linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating 

15 the noise level via the mean residual square and scaling the penalty in proportion to the estimated 

^ I g noise level. The iterative algorithm costs little beyond the computation of a path or grid of the 

CN| 17 sparse regression estimator for penalty levels above a proper threshold. For the scaled lasso, the 

I g algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the 

j 19 regression coefficients and noise level. Under mild regularity conditions, we prove that the scaled 

^rj 20 lasso simultaneously yields an estimator for the noise level and an estimated coefficient vector 

• 21 satisfying certain oracle inequalities for prediction, the estimation of the noise level and the 

22 regression coefficients. These inequalities provide sufficient conditions for the consistency and 

23 asymptotic normality of the noise level estimator, including certain cases where the number of 
l— '24 variables is of greater order than the sample size. Parallel results are provided for the least squares 
^ 25 estimation after model selection by the scaled lasso. Numerical results demonstrate the superior 

>• 26 performance of the proposed methods over an earlier proposal of joint convex minimization. 

U 1 ~q Some key words: Convex minimization; estimation after model selection; iterative algorithm; linear regression; oracle 
inequality; penalized least squares; scale invariance; variance estimation. 

^30 
031 
^ 32 

T"!^ 1. Introduction 

. 34 This paper concerns the simultaneous estimation of the regression coefficients and noise level 

^ 35 in a high-dimensional linear model. High-dimensional data analysis is a topic of great current 

^ 36 interest due to the growth of applications where the number of unknowns far exceeds the num- 

37 ber of data points. Among statistical models arising from such applications, linear regression is 

38 one of the best understood. Penalization, convex minimization and thresholding methods have 

39 been proposed, tested with real and simulated data, and proved to control errors in prediction, 

40 estimation and variable selection under various sets of regularity conditions. These methods typ- 

41 ically require an appropriate penalty or threshold level. A larger penalty level may lead to a 

42 simple model with large bias, while a smaller penalty level may lead to a complex noisy model 

43 due to overfitting. Scale-invariance considerations and existing theory suggest that the penalty 

44 level should be proportional to the noise level of the regression model. In the absence of knowl- 

45 edge of the latter level, cross-validation is commonly used to determine the former. However, 

46 cross-validation is computationally costly and theoretically poorly understood, especially for the 

47 purpose of variable selection and the estimation of regression coefficients. The penalty level se- 

48 lected by cross-validation is called the prediction-oracle in Meinshausen & Buhlmann (2006), 
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who gave an example to show that the prediction-oracle solution does not lead to consistent 
model selection for the lasso. 

Estimation of the noise level in high-dimensional regression is interesting in its own right. 
Examples include quality control in manufacturing and risk management in finance. 



Our study is motivated by Stadler et al. (2010) and the comments on that paper by Anto- 



niadis (2010) and Sun & Zhang (2010). Stadler et al. (2010) proposed to estimate the regression 
coefficients and noise level by maximizing their joint log-likelihood with an l\ penalty on the 
regression coefficients. Their method has a unique solution due to the joint concavity of the 



log-likelihood under a certain transformation of the unknown parameters. Sun & Zhang (2010) 
proved that this penalized joint maximum likelihood estimator may result in a positive bias for 
the estimation of the noise level and compared it with two alternatives. The first is a one-step 
bias correction of the penalized joint maximum likelihood estimator. The second is an iterative 
algorithm that alternates between estimating the noise level via the mean residual square and 
scaling the penalty level in a predetermined proportion to the estimated noise level in the lasso 



or minimax concave penalized selection paths. In a simulation experiment Sun & Zhang (2010) 
demonstrated the superiority of the iterative algorithm, compared with the penalized joint max- 
imum likelihood estimator and its bias correction. However, no theoretical results were given 



for the iterative algorithm. Antoniadis (2010) commented on the same problem from a different 
perspective by raising the possibility of adding an i\ penalty to Huber's concomitant joint loss 
function. See, for example, section 7.7 of Huber & RonchettT|( 2009 1. Interestingly, the minimizer 
of this penalized joint convex loss is identical to the equilibrium of the iterative algorithm for the 
lasso path. Thus, the convergence of the iterative algorithm is guaranteed by the convexity. 



In this paper, we study Sun & Zhang (2010 1's iterative algorithm for the joint estimation of 
regression coefficients and the noise level. For the lasso, this is equivalent to jointly minimizing 
Huber's concomitant loss function with the l\ penalty, as Antoniadis ( j2010 ) pointed out. For 
simplicity, we call the equilibrium of this algorithm the scaled version of the penalized regression 
method, for example the scaled lasso or scaled minimax concave penalized selection, depending 
on the choice of penalty function. Under mild regularity conditions, we prove oracle inequalities 
for prediction and the joint estimation of the noise level and regression coefficients for the scaled 
lasso, that imply the consistency and asymptotic normality of the scaled lasso estimator for the 
noise level. In addition, we prove parallel oracle inequalities for the least squares estimation of 
the regression coefficients and noise level after model selection by the scaled lasso. We report 
numerical results on the performance of scaled lasso and other scaled penalized methods, along 
with that of the corresponding least squares estimator after model selection. These theoretical 
and numerical results support the use of the proposed method for high-dimensional regression. 

We use the following notation throughout the paper. For a vector v = (v±, 



) 1 / q denotes the £„ norm with the usual extensions 



max,- 



and 



,v P ) 
Mo : 



\"\q ~ 



Vj 7^ 0}. For design matrices X and subsets A of {1, . . . ,p}, Xj denotes column vectors of X 
and Xa denotes the matrix composed of columns with indices in A. Moreover, x + = max(x, 0). 



An iterative algorithm 



Suppose we observe a design matrix X = (x\, 



6 R nx P and a response vector y G R n . 



For penalty functions p(-), consider penalized loss functions of the form 



2n 



+ A 2 5>(|/3 i |/A) 



(1) 
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where /3 = (Pi, . . . , j3 p )' is a vector of regression coefficients. Let the penalty pit) be standard- 
ized to jo(0+) = 1, where p(t) = (d/dt)p(t). A vector (3 = (/3i, . . . , (3 P )' is a critical point of the 
penalized loss ([T]) if and only if 



x'Ay - XP)/n = Asgn(^)p(l^l/A), % + °> 



x'Jy-X/3)/ne A[-l,l], 



Pi = o. 



(2) 



If the penalized loss ([TJ is convex in /3, then (|2]) is the Karush-Kuhn-Tucker condition for its 
minimization. 

Given a penalty function p(-), one still has to choose a penalty level A to arrive at a solution of 
Q. Such a choice may depend on the purpose of estimation, since variable selection may require 
a larger A than does prediction. However, scale-invariance considerations and theoretical results 
suggest using a penalty level proportional to the noise level a. This motivates a scaled penalized 
least squares estimator as a numerical equilibrium in the following iterative algorithm: 



a 
A 



\y - 

aX 



AT^VUl-aM 1 / 2 , 



(3) 



(3 <- /3 new , L A (/3 new ) < L x (0 old ) 



where Ao is a prefixed penalty level, not depending on cr, a estimates the noise level, and a > 
provides an option for a degrees-of-freedom adjustment with a > 0. For p < n and (a, Ao) = 
(p/n,0), (pi) initialized with the least squares estimator /?( lsc ) is non-iterative and gives a 2 = 
\y — X f}^ se 'W/ \n —p). For large data sets, one may use a few passes of a gr adient descent 



algorit hm to compute /3 new from /? old . For a = 0, this algorithm was considered in Sun & Zhang 



is a 



(2010). In Sun & Zhang (2010) and the numerical experiments reported in Section 4, (3 
solution of (|2]) for the given A. We describe this implementation in the following twoparagraphs. 

The first step of our implementation is the computation of a solution path /3(A) of O) begi nning 
from / 3(A) = for A = ^'y/n^. For quadratic spline penalties p(t) with m knots, Zhang 
(2010 1 developed an algorithm to compute a linear spline path of solutions {A^ © /?W : t > 0} 
of (2 1 to cover the entire range of A. This extends the least angle regression solution or lasso 

1 and includes the minimax concave 



m 



pafh( |Osborne et al.[ |2000a|b[ |Efron et al.[ [20041 ) from : 
penalty for m = 2 and the smoothly clipped absolute deviation penalty (Fan & Li |2001[ ) for m = 
3. An R package named plus is available for computing the solution paths for these penalties. 

The second step of our implementation is the iteration (|3]> along the solution path /3(A) com- 
puted in the first step. That is to use the already computed 



/3 L 



/3(A) 



(4) 



in ([3]). For the scaled lasso, we use a = in ([3]) and p(t) = t in ([!]) and ([2]). For the scaled 
minimax concave penalized selection, we use a = and the minimax concave penalty p(t) = 
J*q(1 — x/7) + dx, where7 > regularizes the maximum concavity of the penalty. When 7 = 00, 
it becomes the scaled lasso. The algorithm ([3]) can be easily implemented once a solution path is 
computed. 

Consider the £\ penalty. As discussed in the introduction, ([3]) and Q form an alternating 
minimization algorithm for the penalized joint loss function 



Xf3\ 2 



2na 



> (1 
2 + — 



a)a 



+ A |/3h 



(5) 
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Antoniadis (2010) suggested this jointly convex loss function as a way of extending Hu- 
ber's robust regression method to high dimensions. For a = and A = a\o with fixed a, 
aL\ (l3,d) = L\(f3) + a 2 /2, so that /3 <— /3(A) in minimizes L\ (p,a) over /3. For fixed 
/3, u 2 i — \y — X/3| 2 /{( 1 — a)n} in ^ minimizes L\ ((3, a) over a. During the revision of this 
paper, we learned that She & Owen ( 201 1] ) have considered penalizing Huber's concomitant 
loss function for outlier detection in linear regression. We summarize some properties of the 
algorithm ([3]) with ([4]) in the following proposition. 

Proposition 1. Let ft = /3(A) be a solution path of (|2[) with p(t) = t. The penalized loss 
function |5]) is jointly convex in (/3, a) and the algorithm ([i]) with Q converges to 



(13, a) = argminL Ao (/3,cj). 

/3,<7 



(6) 



The resulting estimators /3 = [3(X, y) and a = &(X, y) are scale equivariant in y in the sense 
that (3(X, cy) = cf3(X, y) and a(X, cy) = \c\a(X, y). Moreover, 







L\ {f3(o-\ ),a} 



I- a \y-Xp{<j\ )\ 



(7) 



Since §5§ is not strictly convex, the joint estimator may not be unique for some data (X,y). 
However, since ((5} is strictly convex in a, a is always unique in ^ and the uniqueness of /3 
follows from that of the lasso estimator /3(A) at A = aAo; /3(A) is unique when the second part 
of Q is strict in the sense of not hitting ±A when (3j = 0, which holds almost everywhere in 
(X, y) for A > 0. See, for example, [Zhang] ( |20 1 0| ) . 
Let 5(A) = \y - X/3(A)| 2 /{(1 - c^n} 1 / 2 . For A = {(2/n) log p} 1 / 2 , Q implies that 

a = a(X), A = min{A:5 2 (A) < nA 2 /(21ogp)}. (8) 

While the present paper continues our earher work dSun & Zhang| |2010 ) by providing further 
theoretical and numerical justification s for ([3]) and (4i, the estimator has appeared in different 
forms. In addition to (|5]> and (|6]> of|Antoniadis1(|2010f7([8]) appeared in|Zhang|(|2010[). While this 



paper was in revision, a reviewer called our attention to |Belloni et alJJ2011[ ), who focused on 
studying /3 in an equivalent form as square-root lasso. We note that (jij and (Q allow concave 



penalties and degrees of freedom adjustments as in Zhang (2010). 



3. Theoretical results 
3-1. Analysis of scaled lasso 
Let j3* be a vector of true regression coefficients. An expert with oracular knowledge of /3* 
would estimate the noise level by the oracle estimator 

X/3*| 2 /n 1/2 . (9) 



a 



\y 



Under the Gaussian assumption, this is the maximum likelihood estimator for a when /3* is 
known and n(a* /a) 2 follows the Xn distribution. Due to the scale equivariance of a in Propo- 
sition 1 , it is natural to use a* as an estimation target with or without the Gaussian assumption. 
We derive upper and lower bounds for a /a* — 1 and use them to prove the consistency and 
asymptotic normality of a. We derive oracle inequalities for the prediction performance and the 
estimation of j3 under the l q loss. Throughout the sequel, pr/3 j(7 is the probability measure un- 
der which y - X(3 ~ JV(0, a 2 I n ). We assume that I Xj\ 2 — Th whenever prg a is invoked. The 
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asymptotic theory here concerns n — > oo and allows all parameters and variables to depend on 
n, including p >n> \f3\o — > oo. 

We first provide the consistency for the estimation of a via an oracle inequality for the pre- 
diction error of the scaled lasso. In our first theorem, the relative error for the estimation of a 



is bounded by a quantity tq related to a prediction error bound 77 (A, £, w, T) in ( 10 1 below. For 

A > 0, f > 1, w G R p , and T C {1, . . . ,p}, define 5 W;T = 1 — I(w = /3*, T = 0) and 

?7(A, C, w, T) = \XP* - Xw\l/n + (1 + 5 w , t )2\\wtAi + u j^NS ^ (10) 



where k(£, T), the compatibility factor ( van de Geer & Biihlmann[ 2009 1, is defined as 

\T\ x l 2 \Xu\ 2 



mm 



l n 1 



-/2 



: u E 



UT 1 



<T(£,T), u^o} 



(11) 



with the cone T) = {« : |ut c |i < £I u t|i}- Since the prediction error bound r/(X, £, w, T) 
is valid for all w and T, tq is related to its minimum over all w and T at the oracle scale a* : 



TO 



vl /2 {o-*X ,^)/a*, ?7*(A,£) = inf rj(\,£,w,T). 



(12) 



THEOREM 1. Lef ((3, a) be the scaled lasso estimator in /3* G R p , cr* f/ie oracle noise 
leveling, z* = \X'\y - X $*) / / a* and £ > 1. Wfen z* < (1 - t )A (£ -!)/(£ + 1), 



max ( 1 



(7 



1 



a 
a 



\Xf3-Xf3*\ 2 1 1/2/ ^A a 

- r °' T72~^ — V* (i »C) < T" 

n-V^er* cr* V 1 — tq / 1 



70 



(13) 

a" cr / n^/^a" o* vi — tq "/ i — tq 

In particular, ifXo = A{(2/n) logp} 1 / 2 with A > (£ + l)/(£ — 1) and r/*(crAo, C)/ " ~~ ^ 0> 

pr^ i(J (|a/a-l| >e) ^0 (14) 

for all e > 0. 

Theorem[T]extends to the scaled lasso a unification of prediction oracle inequalities for a fixed 
penalty. With A = cr*A /(l - r )+, O gives max{(cr*T ) 2 , \X/3 - Xfi*\l/n} < r/*(A, £), or 



max{(ff*T ) 2 , - Xf3*\l/n} < min{\Xw - Xp*\ 2 2 /n + 4CA ^min(A, \wj\)} (15) 



for a (7 > 1, if the minimum in ( 15 ) is attained at a w with (1 + 1/£) 2 k 2 (<!;, T) > 1/C, where 
T = {j : \wj\ > A}. This asserts that for an arbitrary, possibly non-sparse /?*, the prediction error 
of the scaled lasso is no greater than that of the best linear predictor Xw with a sparse w for an 
additional capped-^ cost of the order A Y^j min(A, \wj |). A consequence of this prediction error 
bound for the scaled lasso is the consistency of the corresponding estimator of the noise level in 
( 14 ). Due to the scale equivariance in Proposition [T] Theorem [T] and the results in the rest of the 
section are all scale free. 

For fixed penalty A, the upper bound t?(A, £, w, T) has been previously established for dif- 
ferent w and T, with possibly different constant facto rs. Examples include 7/(A,£, /3*, 



2A|/3*|i (Greenshtein & Ritov 2004} |Greenshtein 2006 1, r?(A, £, S>) < A 2 |/3*| with 5, 



{j : Wj ^ 0} ( van de Geer & Biihlmann[ |2009| >, and mm w r](\,£,w,S w ) = min w {\Xf3* 
Xw\l/n + Q(X z \w\o)} (|Koltchinskii et al.||2011^ . In ([To]), the coefficient for \Xw - Xf}*\\/n 
is 1 as in Kolt chinskii et al. ( 201 1 >. 
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Now we provide sharper convergence rates and the asymptotic normality for the scaled lasso 
estimation of the noise level a. This sharper rate A/i(A, £,)/a 2 , essentially taking the square of 



the order tq in ( 13 1, is based on the following t\ error bound for the estimation of /3, 

-|/3* c |! \\T\/{2(l-v)} 



m( A, £) = (£ + 1) min inf max 

T 0<v<l 

This l\ error bound has the interpretation 



+ u)/(l -v),T} 



|£-rii<M(A,£)<C^min(A,|/3* 



(16) 



(17) 



if C > (1 + max{2 , 1/K 2 (2g + 1,T)} withT = {j : \ /3*\ > A}. T his allows 0* to have many 
small elements, as in |Zhang & Huang[ ( |2008) , |Zhang| ( [2009] ) and |Ye & Zhang] pMty. The 
bound //(A, £)<(£ + l)\\Sp* |/{2k 2 (£, Sp*)} improves upon its earlier version in van de Geer 
|&Buhlmann](|2009[) by a constant factor 4£/(£ + 1) e (2, 4). 



THEOREM 2. Let (3, a, (3*, a*, z* and £ be as in Theorem^ Set r* = {X fi(a*X , 0/ a *} 1 ^ 2 - 
(i) The following inequalities hold when z* < (1 — t 2 )Aq(£ — !)/(£ + 1), 



max (1-5/(7*, 1-^/5) < r 2 , |/3 - p\ x < n(a*Xo,Z)/(l - n 2 ). 

(ii) Let A > {(2/n) log^/e)} 1 / 2 ^ + l)/{(^ - 1)(1 - r 2 )}. For all e>0 and n- 
\og{p/e) ->• oo, 

p 1>j(T {z* < (1 - r 2 )A (e - l)/« + 1)} > 1 - {1 + o(l)}e/{K\og{p/e)Yl\ 
I/Xq = A{(2/n) logp} 1 / 2 with A > + l)/(£ - 1) W X p(aX ,(,)/a <C n- 1 ^ 

n 1 / 2 ^/^-!) -)-JV(0,l/2) 

m distribution under pro* CT . 



(18) 

2 > 



(19) 



Since cr 2 r 2 w /i(A, < 2(^ + 1) min T r?(A, 2^ + 1, /3*,T) with A = cjAq, the rate r 2 in ( 18 1 



is essentially the square of that in ( [13] ), in view of ( |T2| ). It follows that the scaled lasso provides a 
faster convergence rate than does the penalized maximum likelihood estimator for the estimation 
of the noise level ( |Stadler et alj |20T0"t |Sun & Zhang||2l)T0"| ). In particular, ([18]) implies that 



max (1 - a/a*, 1 - a* J a) < (£ + l)A 2 |S r |/{2 K 2 (^, Sp. )} < | (logp)/ 



n 



(20) 



with Sp* = {j : /3* ^ 0}, when K (£,Sp*) can be treated as a constant. The bounds in (20) 



and its general version ([18]) lead to the asymptotic normality ( 19 1 under proper assumptions. 



Thus, statistical inference about a is justified with the scaled lasso in certain large -p-smaller-n 
cases, for example, when \/3*\o(\ogp)/y/n — > under the compatibility condition (van de Geer 
&Buhlmann [ [2009] ). 

For a fixed penalty level, oracle inequalities for the l q error of the lasso have been established 



in 



& Huang (2008) and 



Bunea et al.|(|2007i,|van de Geer|(|2008| ) and jvan de Geer & Buhlmanh](|20091) fo r q = l, |Zhang 



Bickel etal.|(|2009l for q 6 [1, 2], Meinshausen & Yu (2009 1 for q = 2, and 



Zhang] ([2009]) and | Ye & Zhang| ( |2010[ ) for q > 1. The bounds on a /a* in ([18) and <[20j) allow 
automatic extensions of these existing i q oracle inequalities from the lasso with fixed penalty to 



the scaled lasso. We illustrate this by extending the oracle inequalities of |Ye & Zhang| ( p010| ) 
for the lasso and Candes & Tao ( |2007| ) for the Dantzig selector in the following corollary. Ye & 
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Zhang (20101 used the following sign-restricted cone invertibility factor to separate conditions 
on the error y — X/3* and design X in the derivation of error bounds for the lasso: 



(21) 



where ^(^S) = {u : \u S c\i < ^ O^jx'.Xu < 0, for all j S 1 }. The quantity (21 1 



can be viewed as a generalized restricted eigenvalue comparing the l q loss and the dual norm 
of the £\ penalty with respect to the inner product for the least squares fit. This gives a di- 
rect connection to the Karush-Kuhn-Tucker condition Q. Compared with the restricted eigen- 
value ( Bickel et al.| 2009[ ) and the compatibility factor ( [TT] ), a main advantage of ( [21) is to 
allow all q 6 [1, oo]. In addition, (21 1 yields sharper oracle inequalities (Ye & Zhang 2010). For 
(|A|,|B|,|«| 2 ) = (\a\, [ft] , 1) with A n B = 0, define 



St = max { ± (iXAu/n 1 ^ - l) }, 9 a>b 



max \X' A XBu/n\ 



2' 



(22) 



The quantities in ( 22 1 are used in the uniform uncertainty principle ( Candes & Tao 2007 1 and the 



sparse Riesz condition (Zhang & Huang 2008 ). We note that 1 — 8 a is the minimum eigenvalue 



of X' a Xa/ti among \A\ < a, 1 + <5+ is the corresponding maximum eigenvalue, and d a f, is the 
maximum operator norm of size a x b off-diagonal sub-blocks of the Gram matrix X'X/n. 

COROLLARY 1. Suppose |/3£ c |i = 0. Then, Theorem [2] holds with p(X,S,) replaced by 
A|S|(20/{(£ + l)Fi(£,S')}, and for z* < (1 - t* 2 )A (£ - l)/(£ + 1), 

< kl/Q{a Z + J Xo) < „ (23) 



:i-r2)(c+i)F ff (e,5) 



/or all 1 < q < oo, where k = \S\. In particular, for £ = ^/2 and z* < (1 — r^)Ao(-v/2 — l) z 



l/3-/3*| 2 < 



(8£0 1/2 A o ^7(l 



< 



4fc 1 / : %a7(l-T?) 



(V2 + l)F 2 (x/2, 5) " (^2 + 1)(1 - «r fifc - 6, 



(24) 



The proofs of Theorems 1 and 2 are based on a basic inequality 



|X/3(A) - X/3*\l/n + |X/3(A) - Xw\%/n 



< \Xw 



Xp f \l/n + 2\{\w\ 1 



|0(A)|i} + 2aV>-/3(A)| ] 



(25) 

as a consequen ce of the Karush-Kuhn-Tucker co nditions ([2]). The version of (25 1 with w = f3* 
is well-known (van de Geer & Biihlmann 2009fr and controls \X/3(\) - Xj3*\% for sparse /3*. 
When \X/3(\) 



Xp*\\ > \Xw 



X^*\n, i25h controls the excess for sparse w by the same 



argument. The general w is taken in Theorem 1, while w = (3* is taken in Theorem 2. In both 



cases, (25 ) provides the cone condition in ( 1 1 1 and (21 1. This is used to derive upper and lower 
bounds for the derivative of the profile loss function L\ ((3(a\o), a) with respect to a, 
within a neighborhood of a /a* = 1. The bounds for the minimizer a then follow from the joint 
convexity of the penalized loss (|5]). 

3-2. Estimation after model selection 
We have proved that without requiring the knowledge of a, the scaled lasso enjoys prediction 
and estimation properties comparable to the best known theoretical results for known a, and the 
scaled lasso estimate of a enjoys consistency and asymptotic normality properties under proper 
conditions. However, the lasso estimator may have substantial bias (Fan & Peng 2004[ |Zhang| 
2010), and its bias is significant in our own simulation experiments. Although the smoothly 
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clipped absolute deviation and minimax concave penalized selectors were introduced to remove 
the bias of the lasso (Fan & Li 2001 Zhang| 2010| ), a theoretical study of their scaled version 
Q is beyond the scope of this paper. In this subsection, we present theoretical results for another 
bias removing method: least squares estimation after model selection. 

Given an estimator (3 of the coefficient vector f3, the least squares estimator of /3 and the 
corresponding estimator of the noise level a in the model selected by /3 are 



arg mm 



\y 



Xf3\l :supp(/3) Csupp(^)} : 



a 



\y - X/3 | 2 /V™, 



(26) 



where supp(/3) = {j : f3j ^ 0}. Alternatively, we may use a = \y — X/3 \ 2 /V( n ~ \ fl\o) to esti- 
mate the noise level. However, since the effect of this degrees of freedom adjustment is of smaller 



order than our error bound, we will focus on the simpler (26 1 



In addition to the compatibility factor k(£, S) in ( 11 1, we use sparse eigenvalues to study the 
least squares estimation after the scaled lasso selection. Let Amin (M) be the smallest eigenvalue 
of a matrix M and \ max (M) the largest . For models Tc{l,...,p}, define 



k_ (m, T) 



min A r 

BDT,\B\T\<m 



x {X' B X B /n), K + (m,T) 



min A r 

BnT0,\B\<m 



as the sparse lower eigenvalue of the Gram matrix for models containing T and the sparse up- 
per eigenvalue for models disjoint with T. Let S = supp(/3*) and S = supp(/3). The following 
theorem provides prediction and estimation error bounds for ( |26| > after the scaled lasso selection, 
along with an upper bound for the false positive \S \ S\, a key element in our study. 



THEOREM 3. Let (/3,a) be the scaled lasso estimator in (|6|) and (73, ex) the least squares 
estimator ( 26 ) in model S. Let (3* , a* , z* , £ and r* be as in Theorem |2j and m be an integer 
satisfying |S|£ 2 /k 2 (£, S) < m/K + {m, S). If z* < (1 - t* 2 )A (£ - l)/(£ + 1), then 

\S\S\< m, a 2 - {Cis + VV*(X0} 2 <° 2 < °\ 



(27) 



'm,S 



max BDS,\B\S\=m\(y ~ XP*) B \2/V n > and 

2 



with A = a\o < o-*A /(l - t 2 ) and «r* 

Mm " 1, S)\P - < \XP ~ X(3*\ 2 2 /n < {a* m _ ljS + 2^(\,0Y- ( 28 ) 

Moreover, in addition to the probability bound for z* < (1 — t 2 )Ao(£ — l)/(£ + 1) in Theorem^ 
( ii), for all integers 1 < m < p, 

pr^ )CT [< 5 (Vn)/a>V(m+|5|) + V{(2m)log(ep/m) + 21og(l/e)}] < (29) 

For Gaussian design matrices, the sparse eigenvalues 0) and K + (m, 0) can be treated 

as constants when m(logp)/n is small and the eigenvalues of the expected Gram matrix are 



uniformly bounded away from zero and infinity ( Zhang & Huang| 2008[ ). Since K_(m, S) > 
+ | iS' |,0) and S) < K + (m, 0), they can be treated as constants in the same sense in 

Theorem[3] Thus, for sufficiently small \ S\ (log p) /n, we may take an m of the same order as \S\. 
In this case, the difference between {13, a) and the scaled lasso estimator (f3,a) is of no greater 
order than the difference between (j3, a) and the estimation target (/3*, a*). Consequently, 



\a/a - l| + \p 



Xf3 - X(3*\ z Jn = P (l)\S\(logp)/n. 



As we have mentioned earlier, the key element in our analysis of (26 1 is the bound \S \ S\ < 
m in (27 1. Since this is a weaker assertion than variable consistency S = S, the conditions of 
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Table 1 : Performance of five methods in Example 1 at penalty levels Xj = 
{2 J_1 (logp)/n} 1 / 2 (j = 1,2,3), across 100 replications, in terms of the 
bias ( x 1 0) and standard error ( x 1 0) of ir / cr f or the selector and a /a for the 
least squares estimator after model selection, the average model size, and 
the relative frequency of sure screening, along with the simulation results 
in 



Fanet al. (2012) 









ro = 








ro =0-5 






Methnrl 




a/a 


a/a 


AMS 


SSP 




a/a 


AMS 


SSP 




Ai 


l-6±0-6 


-M±0-7 


7-6 


1-0 


l-5±0-6 


-l-0±0-7 


9-7 


10 


PMLE 


A 2 


2-5±0-6 


-0-l±0-6 


30 


1-0 


2-5±0-6 


-0-3±0-6 


5-2 


10 






3-6±0-7 


1-2±M 


1-8 


0-2 


3. 8±0-6 


— 0-2±0-6 


3.7 


10 




A, 


O5±0-6 


— 1-8±0-7 


11-9 


1-0 


0.o±0-6 


— 1-9±0-7 


15-5 


10 


BC 


A) 


l-6±0-6 


— 0-l±0-6 


3- 1 


1-0 


0-7±0-6 


_ 0-3±0-6 


61 


10 




A3 


3-3±0-7 


M±M 


1-9 


0-3 


l-9±0-7 


— 0-2±0-6 


4-2 


1 


Scaled 


A i 


0-0±0-6 


— 2-l±0-8 


14-6 


1 -0 


— 0-5±0-6 


— 2-3±0-7 


18-6 


10 


lasso 




1 ^-1-0 7 
1 ■ JztU / 


ft 9-1-0 A 
— U ZztU 


JV 1 










1 ft 




A 3 


3-l±0-7 


1-0±M 


1-9 


0-3 


l-2±0-7 


-0-2±0-6 


4-4 


10 


Scaled 


Ai 


-l-2±0-8 


-2-4±0-8 


14-1 


1-0 


-0-7±0-6 


-2-2±0-8 


13-8 


10 


MCP 


A 2 


-0-l±0-6 


-0-l±0-6 


3-1 


1-0 


0-1 ±0-6 


-0-2±0-6 


3-2 


10 




A 3 


l-5±l-3 


0-6±M 


2-4 


0-6 


0-6±0-7 


-01±0-6 


3-0 


10 


Scaled 


Ai 


-0-6±0-6 


-2-2±0-7 


14-0 


1-0 


-0-4±0-6 


-2-2±0-8 


13-9 


10 


SCAD 


A 2 


0-8±l-0 


-0-l±0-6 


3-1 


1-0 


0-4±0-6 


-0-3±0-6 


3-8 


10 




A 3 


3-l±0-7 


0-9±l-l 


2-0 


0-3 


l-2±0-7 


-0-2±0-6 


3-8 


10 


N-LASSO 




-5-3 ±2-0 




36-6 


1-0 


-4-6 ±2-0 




29-6 


10 


RCV-SIS 




0-2 ± 1-4 




500 


0-9 


-0-1 ± 1-4 




50-0 


10 


RCV-ISIS 




0-5 ± 1-7 




30-9 


0-7 


0-2 ± 1-2 




290 


0-8 


RCV-LASSO 




0± 1-3 




31-1 


0-9 


-0-3 ± 1-1 




26-5 


10 


P-SCAD 




-1-4 ± 1-1 




300 


1-0 


-1-2 ± 1-7 




29-9 


10 


CV-SCAD 




0-7 ± 1-2 




300 


1-0 


0-9 ± 1-3 




29-9 


10 


P-LASSO 




-0-8 ± 2-1 




36-5 


1-0 


-0-9 ± 1-5 




29-6 


10 


CV-LASSO 




1-4 ± 1-1 




36-5 


1-0 


0-8 ± 1-0 




29-6 


10 



PMLE, l\ penalized maximum likelihood estimator; BC, bias-corrected estimator; MCP, minimax concave 
penalty; SCAD, smoothly clipped absolute deviation penalty; N, naive; RCV, refitted cross-validation; SIS, sure 
independent screening; ISIS, iterative SIS; P, plug-in method with degrees-of-freedom correction; CV, cross- 
validation; AMS, average model size; SSP, relative frequency of sure screening 



Theorem [3] on the design matrix is of a weaker form than the irrepresentability condition for 
variable selection consistency dMeinshausen & Buhlmann 2006 ; Zhao & Yiij |2006| ). In Zhang 
& Huang| ( |2008| ) and |Zhang| ( |2010| ), upper bounds for the false positive were obtained under a 
sparse Riesz condition on «;_(m, 0) and K+(m, 0). 
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4. Numerical results 
4-1. Simulation study 

In this section, we present some simulation results to compare five methods: the scaled pe- 
nalized methods with the l\ penalty, the minimax concave penalty and the smoothly clipped 
absolute deviation penalty, the t\ penalized maximum likelihood estimator ( Stadler et al.| 2010[ ), 



and its bias correction ( [Sun & Zhang| 2010| ). The least squares estimator after model selection 



by these five methods is also studied. The penalized maximum likelihood estimator is 



2 



2a 2 



n a 



or equivalently the limit of the iteration a <— W(y — Xj3)/n\ 1 / 2 and f3 <— (3(a\o). The bias- 
corrected estimator is one iteration of ([5]) with Q from (/3( pml °), cr( pmle )) with a = 0, 

a( bc ) = \y- P(a^ mle h )\ 2 /n 1/2 , ^ (bc) = ^(? (bc) A ). 

Two simulation examples are considered. 

Example 1. We compare the five estimators at three penalty levels Xj = - v /{2- ? ^ 1 (logp)/n}, 
j = 1, 2, 3. The experiment has the setting of Example 2 in |Fan et"aL ( 2012[ ), with the small- 



est signal, b = l/\/3. We provide their description of the simulation setting in our notation 
as follows: X has independent and identically distributed Gaussian rows with marginal distri- 
bution N(0, 1), corr(xj, Xj) = ro for 1 < i < j < 50 and corr(xj, Xj) = otherwise, (n,p) = 
(200, 2000), nonzero coefficients Pj = l/y/3 for j £ S = {1, 2, 3}, and y - Xf3 ~ JV(0, a 2 1) 
with cr = l. Two configurations are considered: independent columns Xj with tq = and cor- 
related first 50 columns Xj with vq = 0.5. We set 7 = 2/(1 — max \x' k Xj\/n) for the concave 
penalties. 

The top section of Tab le [T] present s our simulation results, while the bottom section includes 



the simulation results of Fan et al. (2012 1 for several joint estimators of (ft, a) using cross- 
validation, without repeating their experiment. In addition to the bias and the standard error of 
the ratios a /a for the five original estimators and a fa for the least squares estimation after model 
selection, w e report the ave rage model size |5| and the relative frequency of sure screening, 



S 5 S, as in Fan et al. (2012), where S = supp(/3) is the selected model. 

Without post processing, the scaled minimax concave penalized selector with the universal 
penalty level A2 = y/{(2/n) logp} clearly outperforms other procedures in this example. How- 
ever, the results of the least squares estimation after model selection at penalty level A2 are nearly 
identical to the top performer for all five methods. In view of the results in average model size 
and sure screening proportion, the success of post processing at A2 is clearly due to the success 
of model selection. The five methods select too few variables at the larger penalty level A3 and 
too many at the smaller Ai, both leading to substantial bias in the estimation of a for ro = 0. For 
ro = 0.5, selecting a slightly smaller model does not harm so much since a substantial portion of 
the effect of the missing variables is explained by the selected variables correlated to them. The 
minimax concave penalized selector is nearly unbiased in this example, so that it does not need 
post processing. Cross-validation methods select about 30 variables when the true model size is 
3. This over selection is probably the reason for the large bias for most cross-validation methods 
and large standard error for all of them. 



Example 2. This experiment has the same setting as in the simulation study in Sun & Zhang 
(2010), where the scaled lasso and the scaled minimax concave penalized selection are called the 
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48 1 

Table 2: Performance of five methods in Example 2 at penalty levels Xj = 
{2 J '~ 1 (logp)/ra} 1 / 2 (J = 1,2,3), across 100 replications, in terms of the 
^ bias (xlO) and standard error (xlO) of a/a and a /a, the false positive, 

and the false negative 



487 








r = 0.1 








r = 0.9 






488 


Method 




a/a 


a/a 


FP 


FN 


a/a 


a/a 


FP 


FN 


489 






















490 




Ai 


5.5±0.3 


0.2±0.3 


3 


11 


2.4±0.3 


-0.4±0.3 


4 


12 


PMLE 


A, 


7.7±0.4 


2.1±0.6 





19 


3.7±0.3 


-0.3±0.3 


1 


15 


4y i 




A, 


9.5±0.4 


6.3±1.1 





30 


5.7±0.3 


-0.1±0.3 





20 
























493 




Ai 


3.2±0.3 


-0.3±0.4 


8 


9 


0.3±0.3 


-0.7±0.3 


9 


11 


494 


BC 


A, 


6.1±0.5 


1.6±0.5 





17 


1.2±0.3 


-0.3±0.3 


2 


13 


495 




A 3 


9.1±0.5 


6.1±1.1 





29 


3.1±0.4 


-0.1±0.3 





18 


496 


Scaled 


Ai 


1.9±0.4 


-0.8±0.4 


14 


7 


0.0±0.3 


-0.8±0.3 


11 


11 


497 


lasso 


A, 


5.0±0.5 


1.3±0.5 





16 


0.6±0.3 


-0.3±0.3 


2 


13 


498 




A 3 


8.9±0.6 


6.0±1.2 





29 


1.8±0.4 


-0.2±0.3 





16 


499 






















500 


Scaled 


Ai 


-0.2±0.4 


-1.1±0.4 


14 


7 


0.0±0.3 


-0.7±0.3 


10 


19 


501 


MCP 


A, 


1.8±0.5 


0.7±0.5 





13 


0.6±0.3 


-0.2±0.3 


1 


20 




A 3 


7.8±1.0 


5.6±1.3 





28 


1.7±0.4 


0.0±0.3 





22 


502 






















503 


Scaled 


Ai 


0.6±0.4 


-1.0±0.4 


14 


7 


0.0±0.3 


-0.8±0.3 


10 


15 


504 


SCAD 


A 2 


4.7±0.6 


1.3±0.5 





16 


0.6±0.3 


-0.3±0.3 


2 


15 


505 




A 3 


8.9±0.6 


6.0±1.2 





29 


1.8±0.4 


-0.2±0.3 
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506 PMLE, l\ penalized maximum likelihood estimator; BC, bias-corrected estimator; MCP, min- 

507 imax concave penalty; SCAD, smoothly clipped absolute deviation penalty; FP, false positive 

508 FN, false negative 

509 
510 



naive estimators. We provide the description of the simulation settings in |Sun & Zh ang (2010]) in 
our notation as follows: (n,p) = (600, 3000), the Xj are normalized columns from a Gaussian 

523 random matrix with independent and identically distributed rows and correlation ro ' ' between 

524 the j-th and fc-th entries within each row, 7 = 2/(1 — max \x' k Xj\/n) for the minimax concave 

525 penalty and smoothly clipped absolute deviation penalty, the nonzero j3* are composed of five 

516 blocks of /3*(1, 2, 3, 4, 3, 2, 1)' centered at random multiples ji, . . . , j'5 of 25, /3* sets = 

517 3n, and y — X(3* is a vector of independent and identically distributed iV(0, 1) variables. Thus, 

518 the true noise level is a = 1. We set ro = 0.1 for low correlation between design vectors and 

519 ro = 0.9 for high correlation. 

520 We summarize the simulation results in Table [2] which provides the bias and standard er- 

521 ror of the ratios a /a for the selector and a /a for the least squares estimator after model se- 

522 lection, the false positive \S \ S\, and the false negative \S \ S\. Without post processing, the 

523 scaled lasso outperforms the penalized maximum likelihood estimator and its bias correction, 

524 which are also based on the lasso path. However, the scaled lasso estimate of a is still biased, 

525 and the level of bias is comparable with the order of the error bound (|5|/n) logp = 0.47 in 



526 (20). This and the failure in sure screening by any method reflect the difficulty of this example, 

527 where IS"! = 35 is not small and the signal is weak, with average /3* = 0.11 and 0.05 respec- 

528 tively for ro = 0.1 and ro = 0.9. From this perspective, the scaled minimax concave penalized 
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529 i 

Table 3: Estimated coefficients (xlO ) of selected probe sets by four methods in the real data 

example: the lasso with cross-validation, the lasso with adjusted cross-validation, and the scaled 

^2 lasso and minimax concave penalized selection at Ao = A2 = {2(logp)/n} 1 / 2 

533 
534 
535 
536 
537 
538 
539 
540 
541 
542 
543 
544 
545 
546 
547 
548 
549 
550 
551 
552 
553 
554 
555 
556 
557 
558 
559 
560 
561 
562 
563 
564 
565 
566 

C-V lasso/LSE, the lasso estimator with the adjusted cross-validation; MCP, minimax concave penalty; #cov, 
<-^g the number of covariates considered; A, probes not in the smaller set of 200 probes; * , covariates selected by 

stability selection 

569 
570 

571 selection, designed to reduce the bias of the lasso, performs quite well at the universal penalty 

572 level A2 = y/{(2 /ri) log p}, especially with post processing. The least squares estimation af- 

573 ter model selection reduces the bias substantially in all cases, even without successful model 

574 selection. This example seems to suggest the possibility of improving the performance of the 

575 scaled estimators at a penalty level A smaller than the universal penalty level A2, a simple upper 

576 bound for \X'(y — XP)/(a*n)\ 00 under prg . However, consistent variable selection requires 



Probe ID 


C-V lasso 


C-V lasso/LSE 


Scaled lasso 


Scaled MCP 


#cov 


200 


3000 


200 


3000 


200 


3000 


200 


3000 


1369353_at 


—9 • 12 


—7-13* 


—7-09 


—2-79* 


-7-3 


—4-03* 






1370052_at A 




3-65 














1 370429 _at 


—3-22 




—8-94* 


— 11-06 


—8-78* 


—9-36 


— 16-37* 




1371242_at 


—6-66 
















1374106 _at 


8-88* 


10-58* 


7-33* 


614* 


7-45* 


7-01* 


8-47* 


10-02* 


1374131_at 


4-07 


0-80 














1375585_at A 












0-58 






1384204_at 






0-70 




0-70 








1387060_at A 




3-50* 














1388538_at A 




1-42 














1389584_at 


17-16* 


25-39* 


20-07* 


19-61* 


19-97* 


21-18* 


45-75* 


50-49* 


1393979_at 


— 1-81 




—0-22 




—0-4 








1379079_at A 




—1-43* 














1 379495 _at 




4-84 


1-73 




1-71 


1 • 00 






1 37997 l_at 


13-56* 


13-1 


11- 19* 


8-81 


11-25* 


9-52 






1380033_at 


8-69 




2-76 




2-97 




6-75* 




1380070_at A 




0-19 














1381787 _at 


-2-05 




-2-01 




-2-11 








1382452_at A 




12-93 








1-63 




12-91* 


1382835_at 


12-64 


5-79 


3-73 




4-15 








1383110 at 


Q HI* 

y-Vo 


1 Q QQ 

Vy-yy 


1 ^ 1 n* 


1D-4J 


1 A QT* 




1 J-oU 




13s3522_at 


3-03* 




* 




* 








1383673_at 


5-54 


6-12* 


6-07 


6-15* 


6-08 


6-47* 






1 383749 _at 


-13-86 


-10-85* 


-10-84 


-6-7* 


-11-02 


-8-07* 


-2-74 * 


-1-11* 


1383996_at 


25-01* 


17-82* 


18-61* 


14-30* 


18-88* 


15-52* 


25-07* 


19-19* 


1385687_at A 




-0-99 














13 86683 _at 








4-60* 




2-90* 






1390788_a_at 


0-92 
















1392692_at A 




1-74 














1393382_at 


2-43 
















1393684_at 


1-59 
















1395076_at A 












0-23 






1397489_at A 




3-33 














Model size 
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14 


7 


6 


A = ctAq 


0.0103 


0.0163 


0.025 


0.035 
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600 
601 
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200 covariates being considered 



3000 covariates being considered 



0.04 0.06 
lambda 



Fig. 1: Mean squared prediction error against penalty level: solid curve, 
testing error of the lasso for fixed A; dotted curve, training error of the 
lasso for fixed A; dot-dashed line, testing error of the scaled lasso with 
fixed Aq = y/{(2/n) logp}, or equivalently the lasso at penalty level aAo- 



A > \X'(y — Xf3*)/{a*n)\ 00 as Example 1 demonstrates. Since the scaled lasso estimator a is 
an increasing function of the penalty level by Proposition [T] it is always possible to reduce the 
bias of a to zero by taking a specific A for each specific example. However, the two examples in 
our simulation experiment demonstrate the difficulty of picking such a penalty level consistently. 



4-2. Real data example 



We study a data set containing 18976 probes for 120 rats, which is reported in Scheetz et al. 



(2006). Our goal is to find probes that are related to that of gene TRIM32, which has been found 
to cause Bardet-Biedl syndrome, a genetically heterogeneous disease of multiple organ systems 
including the retina. We consider linear regression with the probe from TRIM32, 1389163_at, as 



the response variable. As in Huang et al. ( 2008 1, we focus on 3000 probes with the largest vari- 



ances among the 18975 covariates and consider two approaches. The first approach is to regress 
on these p = 3000 probes. The second approach is to regress on the 200 probes among the 3000 
with the largest marginal correlation coefficients with TRIM32. For the cross-validation lasso, 
we randomly partition the data 1000 times, each with a training set of size 80 and a validation set 
of size 40. For each partition, the penalty level A is selected by minimizing the prediction mean 
squared error in the validation set. Then we compute the lasso estimator with all 120 observa- 
tions at the penalty level equal to the median of the selected penalty levels with the 1000 random 
partitions. Since cross-validation tends to choose a larger model, we also consider an adjusted 
version using the cross-validated error of the least squares estimator after the lasso selection. For 
the minimax concave penalty, we set 7 = 2/(1 — 00.95) = 6.37, where 00.95 is the 95% quantile 
of \x' k Xj\/n. 

Table[3]shows the probe sets identified by four methods: the cross-validation lasso, its adjusted 
version, the scaled lasso at at universal penalty level A2 = {2(logp)/n} 1 / 2 , and the minimax 



concave penalized selection at the same penalty level. We apply stability selection ( |Meinshausen 
& Buhlmann 2010 1 to check the reliability of selection. Let Wi, . . . , W p be independent vari- 
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pmle 



Scaled lasso 



Scaled MCP 



Table 4: Prediction performance of eight methods in the real data exam- 

626 pie at penalty levels A = Xj = {2 j ~ 1 (logp) /n} l / 2 (j = 1,2, 3), in terms 

6 2 ^ of the prediction mean squared error (xlO 2 ), estimated model size, and 

6 2 ^ correlation coefficient (x 10 2 ) between fitted and observed responses 

629 

630 

631 

632 Method 

633 
634 
635 
636 

637 BC 
638 
639 
640 
641 
642 
643 
644 
645 
646 

647 Scaled SCAD 

648 
649 
650 
651 
652 
653 

PMLE, £i penalized maximum likelihood estimator; BC, bias-corrected PMLE; 
MCP, minimax concave penalty; SCAD, smoothly clipped absolute devia- 

655 tion penalty; C-V lasso/LSEl, the lasso with adjusted cross-validation; C-V 

656 lasso/LSE2, the least squares estimator after the lasso selection with adjusted 

657 cross-validation, #cov, the number of covariates considered; corr, the correla- 
65g tion coefficient between fitted and observed responses; P-MSE, prediction mean 

squared error 

660 

661 ables with P(W = 0.2) = P(W = 1) = 1/2 and 

662 p 

6« r = argmin + A £ \P 3 \/W 3 , 

664 b 2n ^ 

665 

666 where A is the penalty level chosen by individual methods. Stability selection selects variables 

667 with nonzero estimated over 50 times in 100 replications. We observe that the scaled min- 

668 imax concave penalized selector produces most sparse and most stable selection, followed by 

669 the adjusted cross-validation, the scaled lasso and then the plain cross-validation. The selection 

670 results are consistent among the four methods in the sense that the selected models are almost 

67 1 nested. Since the model size is between 6 and 8 by stability selection in all 8 cases and by the 

672 scaled minimax concave penalized selection for both p = 200 and p = 3000, these two methods 



C-V lasso 
C-V lasso/LSEl 
C-V lasso/LSE2 
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1-01 
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673 provide most consistent results. The scaled lasso and the adjusted cross-validation yield identical 

674 lasso and stability selections for p = 200 and identical stability selection for p = 3000. 

675 We also compare the prediction performance of the scaled lasso with that of the lasso with 

676 the best fixed penalty level. We compute the scaled estimators in 1000 replications. In each 

677 replication, the dataset is split at random into a training set with 80 observations and a test set 

678 with 40 observations. The prediction mean squared error is computed within the test set, while 

679 the scaled estimators and the lasso estimator with fixed penalty level A are computed based on 

680 the training set. Figure [T] demonstrates that in prediction, the scaled lasso with Ao chosen as 

681 A2 = {2(logp)/n} 1 / 2 performs almost as well as the lasso with the optimal fixed A. 

682 In addition, we compare the prediction performance of all the estimators mentioned in this 

683 section. In each replication, we compute the penalized maximum likelihood estimator, its bias- 

684 correction, and scaled penalization methods based on the training set of 80 observations. For 

685 cross-validation, the training set of 80 observations is further partitioned at random 100 times into 

686 two groups of sizes 60 and 20, and a penalty level is selected by minimizing the estimated loss in 

687 the smaller group for the lasso estimator based on the larger group. This selected penalty level is 

688 then used for the lasso with the entire training set. Thus, the cross-validation lasso is also based 

689 on the training set with 80 observations. For the penalty level selected by the adjusted cross- 

690 validation, two estimators are considered: the lasso estimator and the least squares estimator 

691 after the lasso selection. In Table |4j we present the medians of the prediction mean squared error 

692 and the selected model size in the 200 replications. The scaled lasso has comparable prediction 

693 performance as cross-validation. Again, Table [4] suggests that original cross-validation tends to 

694 choose larger models, while adjusted cross-validation leads to results comparable with the scaled 

695 lasso. 
696 

697 

69g 5. Discussion 

699 In the theoretical analysis, we have considered Ao = A{(2/n) logp} 1 / 2 with A > 1. This 

700 choice is somewhat conservative from a number of points of view. Simulation results suggest 

701 that the requirement A > 1 is a mathematical technicality. If iX'e/nloo < A* with large prob- 

702 ability for a standard normal vector e, the theoretical results in this paper are all valid under 

703 P r /3,cr when Ao is replaced by the smaller min(Ao, A\*). The value of A* can be estimated by 

704 simulation with the given X and separately generated e. A s omewhat sha rper theoretical choice 



705 of Ao is A{(2/n) log(p/ s)} 1 / 2 with the unknown s = |/3*|o (Zhang 2010 1, or its simulated ver- 

706 sion with A* = maxm =s iX^e^/lT) 1 / 2 . The difference between the two Ao is limited unless 

707 logp = {1 + o(l)} log n. A reviewer called our attention to an unpublished 201 1 report by Ba- 

708 raud, Giraud and Huet, available at http://arxiv.org/abs/1007.2096, whose method can be used to 

709 select a penalty level to nearly minimize the order of a penalized prediction error. This may also 

710 justify the use of smaller estimated penalty levels. 

711 In the proof of our theoretical results for the scaled lasso, we use oracle inequalities for fixed 

712 penalty which unify and somewhat sharpen existing results. We now present this result. Define 



(30) 



714 r,*(\,0 = mm 2" 1 [r/(A, £, (3\ T) + {r] 2 (X, £, /3*,T) - 16A 2 |/^ C | 2 } 1/2 

715 

as a sharper version of rj[X, £, /3*,T) in ( | 10[ >. 

7 1 1 THEOREM 4. Let /3(A) be the minimizer o/([7]) with p(t) = t. Let (3* G R p be a target vector 

andi > 1. Then, in the event \X'(y - Xf3*)\ OQ /n < A(£ - l)/(£ + 1), we have 

720 \Xp(\)-Xp*\ 2 2 /n<mm{ V *(\,0,V*(\0} (31) 
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with 7/* (A, £) in ( 12 1. Moreover, in the same event and with //(A, £) in {16), 

i^(A)-rii<MA,o- 



(32) 



The interpretations of (31 1 and (32 1 are given in ( 15 I and ([17]), along with their relationship 



to several existing results. We note here that the condition S)xl for ( 15 I and ( 17 ), weaker 
than the parallel condition on the restricted eigenvalue ( [Bickel et al.[|2009 1, can be slightly weak- 
ened by using F x (f , S) in <[2TJ \ Ye & Zhang"} |2010[ ). 
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Appendix 

Here we prove Proposition [T] Theorem|4] Theorem[T] Theorem [2] and then Theorem[3] 
Proof of Proposition^ (i) Since P = f3(a\ ) is a solution of ([2|> at A = ctAq, 



{(d/dw)L Xo (w,a) } 



= 0, 



for all (3j(a\o) ^ 0. Since {j : Pj(X)} is unchanged in a neighborhood of ctA , [(d/dcr){P(aXo) /cr}] 3 
for Pj(a A ) = 0. Thus, 



~L Xo 0(aX o ),a} = ^L Ao {^(aA ), t} 



I- a \y- Xp(aX )\ 2 2 
2 2n<r 2 



(ii) The convergence of ^ and Q follows from the joint convexity of L\ {fi, a). The scale invariance 
follows from L (c(3, ca; X, cy) = cL (/3, a; X, y), where L (f3, a; X, y) expresses the dependence of ^ 
on the data (X, y). □ 

Proof of Theorem^ (i) Let (3 = /3(A). Since a*z* = \X'(y - X/3*)U/n and p(\Pj\/X) = 1 for % ^ 
0, the inner product of w — p and the Karush-Kuhn-Tucker condition O) yield 

{XP-Xw)'{Xp-XP*)/n < X(\w\i - + a*z*\w - %. 

Since 2( Xp - Xw)'{XP - X/3*) = \ X(3 - Xw\\ + \Xh\\ - \X(3* - Xw\ 2 2 , this gives the basic in- 
equality J25l. Let h = /3- p*. Since a* z* < A(£ - !)/(£+ 1), \{\w\i - \0\i} + a*z*\w - % is no 



greater than b\(w - P) T \i + 2X\w T -\i - {b/£)\(w - P)t°\i with 6 = 2£A/(£ + l). Thus, (25 1 implies 



\xp-Xw\l/n+\Xh\l/n + (2b/0\(w-P) T <>\i < 2c + 2b\{w - /3) T |i 



(Al) 



with c = \XP* - Xw\l/{2n) + 2X\w T <\i. For T = and w = P*, (Al 1 directly yields \Xh\\jn < c 



2X\P*\\. For general {w, T}, we want to prove 

\Xh\ 2 2 /n < r,(X, £, w, T) = 2c + b 2 /a, a = K 2 (£, T)/\T\. 
It suffices to consider \Xh\ 2 /n > 2c. In this case, P — w G "if (£, T) by ( Al I, so that by |TT} 

a\(w-P) T \l<\XP-Xw\l/n. (A2) 
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Let x = \(w — P)t\i and y = \Xh\ 2 /n. It follows from ( Al i and ( A2i that ax 2 + y < 2c + 2bx. For such 



(x, y), y — 2c< max x {2bx — ax 2 } = b 2 /a. This gives y < 2c + b 2 /a = r](X, £, w, T) 

For w = /3*, it suffices to consider the case y > c = 2A|/Jj. c |i, where the cone condition holds for 



P*. Now, (x, y) satisfies ax 2 < y < c + bx. The maximum of y, attained at ax 2 = c + bx, is 

c + b{b + (b 2 + 4ac) 1 / 2 }/(2a) = [r?(A, £, (3* , T) + {r? 2 (A, £, P*,T) - 4c 2 } 1 / 2 ] /2. 



(A3) 



(ii) Let < v < 1 and T C {1, . . . ,p}. It follows from ( Al i with w — ft* that 



(l+0\Xh\ 2 2 /n + 2X\h T .\ 1 < 2A(f + l)|)8J.|i + 2fA|ft 



T 1- 



It suffices to consider v\h\x > (£ + l)|/3y c |i. In this case 

(l + £)|JOi|l/n + 2A(l-z/)|/i r =|i < 2\{i + v)\h T \i. 



Thus, (1 - v)\h T c\x < (£ + ^)|/it|i, or equivalently h G <?f{(£ + v)/(l - f),T}. It follows from ([TT 
that |X/i|l/n > \h T \ln 2 {{£, + u)/(l - v),T)/\T\, so that 



(1 + 0IMi« {(£ + ")/(l - "),r}/|T| + 2(1 - *,)A|Mi < 2(€ + f)A|/»r|j 



(A4) 



Let x = \hr\i and y = |/it c |i- Write (A4i as ax 2 + by < ex. Subject to this inequality, the maximum 
of x + y is max 1 >Q{3; + (cx — ax 2 )/b}. This maximum, attained at 2ax = b + c, is x(b + c)/(2b) = 
{b + c) 2 /(4ab). Thus, 



Hi < 



{2(£ + l)A} 2 |T| 



4(1 + 0« 2 {(£ + f )/(l - n{2(l - ^)A} 2k»{(£ + „)/(! — j/),T} 
This gives </i(A,Oforj/|/i|i > + WUi- 



□ 



Proo/o/r/zeorem[7J Assume ro < 1 without loss of generality. Consider t > a*(l — To) and the 
penalty level A = tX for the lasso. Since z*a* < a* (I - mUnfg - l)/(£ + 1) < A(£ - l)/(£ + 1) and 
a* = \y — XP*\2/n 1 / 2 , the Cauchy-Schwarz inequality and (31 1 imply 



\\y- Xp{tX )\ 2 /n l ' 2 -a*\ < \Xp(tX ) - Xph/n 1 ' 2 < r)l /2 (tX , £)• 
Since rjl^ 2 (tXo, £) < o-*tq for i < a*, the derivative of the loss with a = satisfies 



2f 



L Ao {/3(U ),t} = i 2 - |y - X/3(iA )||/n < t 2 - (a*) 2 (l - r ) 2 = at i = <7*(1 - r ). 



This implies a > a* (I — tq) by the strict convexity of the profile loss ^ in a. For £ > cr*, r}^ 2 {tXo, £) < 
tr by dTol and (fl2|, so that at t = a* /{I - r ), 



t 2 - \y - Xf3(tX )\l/n > t 2 - (cr* + <t ) 2 > 0. 
This implies a* > a(l — tq) by the strict convexity of <JsJ in a. Thus, the first part of ( 13 i holds. Moreover, 

\Xp- Xp*\ 2 /n 1 ' 2 < 7,y 2 (5Ao,0 < r?* V2 KA /(l - t ),£} < cr*r /(l - r ). 
Finally, since pr ftCT [|X'(2/ - X/3)Hoo < cr{(2/n) logp} 1 / 2 ] ->• 1, |l4|) follows from |l3}. □ 



SMC/? 



(A5) 



The proof of Theorem 2 requires the following lemma. 

LEMMA 1 . Let T m have the t-distribution with m degrees of freedom. Then, there exists e T , 
that for all t > 

pr[T 2 > m{e 2 * 2 /(™-D - 1}] < (1 + , m )e^j^l 2 t). 
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817 Proof of LemmaU] Let x = [m{e 2t2/( - m ~ 1 '> - 1}] 1/2 . Since T m has the t -distribution, 

818 

8i9 ^ ^ „a ^{( m +i)m r u x _!r (w+1)/a dtt 



P \ m > * J rfm/2)fm7r)i/2 I V + m J 



820 V m ' r(m/2)(m7r) 1 /2 

821 < 2r{(m+l)/2} r / ^ -^/^ 

82 2 ~ xT{m/2)(rmr) 1 / 2 J x \ mJ 

823 2r{(m + l)/2}m ^ i x 2 \ -(m-i)/2 



824 a;r(m/2)(m7r) 1 /2( m _ l) 



ry( 1 + ^) 



825 Since x > i{2m/(m - l)} 1 ^, 

826 — 7/ 

827 /> ^ 2 \ . x/2r{(m + l)/2} e-* 2 _ 

828 vs\l m >x ) < r(m/2 ) (m _ i)i/2^i/2 -U + ^^i 

829 

830 where e m = {2/(m - l)} 1/2 r{(m+ 1)/2}/T(to/2) - 1 ->• as m ->■ oo. □ 

83 1 

Proof 0/77zeorem|2] We need to express r 2 as a function of a at er = a* in the proof. Define 

833 M \ X f X fV ^ , 

834 = A ^(^o,OM 0+ = + = £ + 1 ■ 
835 

g36 We have r 2 = 0(cr*) < 1, 0_ < 0(er*) and 0+ < 0(er*)/(l - 0(cr*). 

837 (i) Consider at* < (1 - <£_)A (f - l)/(f + 1). Let A = iA and h = 0(X)-P*. Since 

g^8 -^"/3*)/n|oo = 2*<7*, the Karush-Kuhn-Tucker condition ([2]) gives 

839 
840 
841 
842 

843 as lower and upper bounds for (er*) 2 — \y — X/9(A)||/n. This is a key point in the proof. 

844 For t>a*{l- z*cr* < iA (£ - l)/(£ + 1) = A(£ - l)/(£ + 1), so that {32} in Theorem|4]im- 

845 plies \h\i < fj,(tX ,0- It follows {All} that for t = a* (I - </>_), 
846 

847 i 2 - 12/ - X/3(tA )|l/n < t 2 - (a*) 2 + 2zV>(tA , 

848 < 2*(* - O + 2i V£ - 1)« + I)" VKAo, = 0, 

849 due to 0_ = (£- 1)(£ + ly 1 ^*) = (£- + l) _1 A /x(o-*Ao,£)/o-*. As in the proof of Theorem 

850 [I] 

we find a /a* > 1 — 0_ by (j7j) and the strict convexity of |5]l in a. 



{z*a* + A)|/i|i < (A^)'{y - + y - X0(\)}/n 
= {a*) 2 -\v-Xp(\)\l/n 

= (Xh)'{2{y - X/3*) - Xh}/n < 2z*a*\h\ 1 (A6) 



851 Now we prove that a/a* < 1 + <j> + . For t > a*, /Lt(tA ,0 < (t/a*)fj,(a* A o, b Y (jD- Thus > since 

852 (f - l)/(£ + 1) + 1 = 2cj) + {l - 4>{a*)}/4>{a*) and 0+ < (1 + for t/a* = 1 + 0+, {A6i 

853 and Q imply that 
854 
855 
856 
857 
858 

859 It follows that a /a* < 1 + 4> + by convexity. 

860 since 1 - < d/a* < 1 + </>+, ^(ctAo) — /3*|i < ^(S'Aq,^) < ^(cr*A ,?)(l + 0+). This com- 

861 pletes the proof of (|T8]>. 

862 (ii) Let Zj — x'j{y — Xf3*)/(na*) with z* — maxj< p \zj\. Under pTg, (J , e* = y — X/3* is a vector of 

863 independent and identically distributed normal variables with zero mean. Since a* — \y — Xfi*\/n 1 / 2 , 

864 — z j)/i n ~ l)} 1 ^ 2 follows a i-distribution with n—1 degrees of freedom. Lemma[T|with m = 



t 2 -\y- Xf3(t\ )\ 2 /n > t 2 - (a*) 2 - (z*a* + t\ )fj,(t\ , 

> (t + a*)a*<P+ - {(£ - 1)/(C + 1) + 1 + 0+}tAoM(ff*Ao, 

= (a*) 2 ((2 + 0+)0 + - [20 + {l - + <f>+](l + </>+)<t>(a*)) 

= (a*) 2 + {0(a*)(l + + )-0 + }>O. 
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n — 1 and t 2 = log(p/e) > 2 implies 



Wf>*,<r 



{n~l)z 2 
l-z 2 



> ( n _l){ e 2* 2 /(»-2)_ 1} 



1 + e n -i t 2 _ (1 + e n -i)e/p 



{K\og{p/e)y/ 2 - 



Since e a - 1 < J2T=i a k /2 k ~ 1 = a/(l - a/2) for any < a < 2, 

(n lUe 2tV(n-2) _ u < 2(n-l)t 2 /(n-2) 2(n-l)t 2 /n 
( ,{ f ~ l-t 2 /(n-2) ~ \-2t 2 /n ' 

The combination of \A1\ and ( |A8[ ) yields 

pr^^O^-l > {21og(p/ e )/n} 1 /2] = p V)g | _Li > L_ _2 / | 



(A7) 



(A8) 



< 



-2i 2 /r 

(n — l),z 2 at 2 ~i 

^>(n-l)(e^-l)} 



r l 1 



< {l + e n ^){e/p)/{n\og{ple)Y' 2 . 

Since A > {(2/ra) log^/e)} 1 / 2 ^ + l)/{(£ - 1)(1 - 0-)}, this bounds the tail probability of z* = 
maXjXp | by the union bound. Since n(a* /a) 2 follows the \n distribution, n x l 2 {o* jo — 1) converges 
to N{6, 1 /2) in distribution, which then implies (Tl9} by (fi~8]l under <j)(a) = o(n~ 1/2 ). □ 



Proof of Theorem^ Let h = /3 — /3* . It follows from the proof of Theorem [2] (i) that z* < 
(l-^_)Ao(f-l)/(£ + l) < (ct/<7*)A (^-1)/(^ + 1), so that A + zV <((A-zV). By 
|a^X/i/n| = |a^(y-X/3-£*)/n| > X-z*a* for ^ 0. Let BCS\S with |S| < m. Since 
K+(m,S) is the upper sparse eigenvalue, (A — z*a*) 2 \B\ < K + (m, S)\Xh\ 2 /n. By the basic in- 
equality (25 1 with w — (3*, \Xh\ 2 /n < (A + z*a*)\hs\i and h is in the cone S). Thus, since 
\hs\ 2 K 2 (^S) < \Xh\ 2 \S\/n by jn), |Xh| 2 /n < (A + z*a*) 2 \S\/n 2 ^, 5). It follows that \B\ < 
K+(m,S)£ 2 \S\/n 2 (£„S) < to. Since all B C 5\ S" of size |5| < to have size \B\ < to, 5\ 5 does not 
have a subset of size to. This gives the first inequality in ( f2"7j ). 

Let Pg be the orthogonal projection to the linear span of (xj, j € B). By the definition of a* n s and 

the prediction error bound \Xh\ 2 /n < rj„ (A, £) in Theorem^ 

d 2 - a 2 = \P s (y - XP)\l/n < (|P§ £*| 2 + \PsXh\ 2 ) 2 /n < {a* m _ hS + M% 0} 2 . 
This gives the second inequality in p7) . The prediction error bound follows from 

\X~B- Xf3*\ 2 = \P § y - Xf3*\ 2 < \P s (y - Xp)\ 2 + \Xh\ 2 < {a* m _ hS + 207*(A,O», 
which implies the l 2 estimation error bound due to k_(to — 1, S)\j3 — 0*\ 2 < \Xj3 — X(3*\ 2 /n. Finally, 



the probability bound in ( 29 1 follows directly from an application of the Gaussian concentration inequality 
to the chi-squared variables in the union bound. □ 
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